213 research outputs found

    Position score weighting technique for mining web content outliers.

    Get PDF
    The existing mining web content outlier methods used stemming algorithm to preprocess the web documents and leave the domain dictionary in their root words. The stemming algorithm was usually used to reduce derived words to their stem, base or root form. The stemming algorithm sometimes does not leave a real word after removing the stem and it caused a problem to match words in the full word profile with the domain dictionary. Therefore this study uses stemmed domain dictionary and applies it with Term Frequency with Position Score (TF.PS) weighting technique which is derived from TF.IDF weighting technique from Information Retrieval (IR) in dissimilarity measure phase to see the efficiency of these technique for determining the outliers in the web content. The dataset is from The 20 Newsgroups Dataset. The result for stemmed domain dictionary with TF.PS weighting technique achieves up to 98.19% of accuracy and 90% of F1-Measure which is higher than previous techniques

    Frequent Lexicographic Algorithm for Mining Association Rules

    Get PDF
    The recent progress in computer storage technology have enable many organisations to collect and store a huge amount of data which is lead to growing demand for new techniques that can intelligently transform massive data into useful information and knowledge. The concept of data mining has brought the attention of business community in finding techniques that can extract nontrivial, implicit, previously unknown and potentially useful information from databases. Association rule mining is one of the data mining techniques which discovers strong association or correlation relationships among data. The primary concept of association rule algorithms consist of two phase procedure. In the first phase, all frequent patterns are found and the second phase uses these frequent patterns in order to generate all strong rules. The common precision measures used to complete these phases are support and confidence. Having been investigated intensively during the past few years, it has been shown that the first phase involves a major computational task. Although the second phase seems to be more straightforward, it can be costly because the size of the generated rules are normally large and in contrast only a small fraction of these rules are typically useful and important. As response to these challenges, this study is devoted towards finding faster methods for searching frequent patterns and discovery of association rules in concise form. An algorithm called Flex (Frequent lexicographic patterns) has been proposed in obtaining a good performance of searching li-equent patterns. The algorithm involved the construction of the nodes of a lexicographic tree that represent frequent patterns. Depth first strategy and vertical counting strategy are used in mining frequent patterns and computing the support of the patterns respectively. The mined frequent patterns are then used in generating association rules. Three models were applied in this task which consist of traditional model, constraint model and representative model which produce three kinds of rules respectively; all association rules, association rules with 1-consequence and representative rules. As an additional utility in the representative model, this study proposed a set-theoretical intersection to assist users in finding duplicated rules. Four datasets from UCI machine learning repositories and domain theories except the pumsb dataset were experimented. The Flex algorithm and the other two existing algorithms Apriori and DIC under the same specification are tested toward these datasets and their extraction times for mining frequent patterns were recorded and compared. The experimental results showed that the proposed algorithm outperformed both existing algorithms especially for the case of long patterns. It also gave promising results in the case of short patterns. Two of the datasets were then chosen for further experiment on the scalability of the algorithms by increasing their size of transactions up to six times. The scale-up experiment showed that the proposed algorithm is more scalable than the other existing algorithms. The implementation of an adopted theory of representative model proved that this model is more concise than the other two models. It is shown by number of rules generated from the chosen models. Besides a small set of rules obtained, the representative model also having the lossless information and soundness properties meaning that it covers all interesting association rules and forbid derivation of weak rules. It is theoretically proven that the proposed set-theoretical intersection is able to assist users in knowing the duplication rules exist in representative model

    Training Process Reduction Based On Potential Weights Linear Analysis To Accelarate Back Propagation Network

    Get PDF
    Learning is the important property of Back Propagation Network (BPN) and finding the suitable weights and thresholds during training in order to improve training time as well as achieve high accuracy. Currently, data pre-processing such as dimension reduction input values and pre-training are the contributing factors in developing efficient techniques for reducing training time with high accuracy and initialization of the weights is the important issue which is random and creates paradox, and leads to low accuracy with high training time. One good data preprocessing technique for accelerating BPN classification is dimension reduction technique but it has problem of missing data. In this paper, we study current pre-training techniques and new preprocessing technique called Potential Weight Linear Analysis (PWLA) which combines normalization, dimension reduction input values and pre-training. In PWLA, the first data preprocessing is performed for generating normalized input values and then applying them by pre-training technique in order to obtain the potential weights. After these phases, dimension of input values matrix will be reduced by using real potential weights. For experiment results XOR problem and three datasets, which are SPECT Heart, SPECTF Heart and Liver disorders (BUPA) will be evaluated. Our results, however, will show that the new technique of PWLA will change BPN to new Supervised Multi Layer Feed Forward Neural Network (SMFFNN) model with high accuracy in one epoch without training cycle. Also PWLA will be able to have power of non linear supervised and unsupervised dimension reduction property for applying by other supervised multi layer feed forward neural network model in future work.Comment: 11 pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS 2009, ISSN 1947 5500, Impact factor 0.42

    Term frequency-information content for focused crawling to predict relevant web pages.

    Get PDF
    With the rapid growth of the Web, finding desirable information on the Internet is a tedious and time consuming task. Focused crawlers are the golden keys to solve this issue through mining of the Web content. In this regard, a variety of methods have been devised and implemented. Many of these methods coming from information retrieval viewpoint are not biased towards more informative terms in multi-term topics (topics with more than one keyword). In this paper, by considering terms’ information contents, we propose Term Frequency-Information Content (TF-IC) method which assigns appropriate weight to each term in a multi-term topic. Through the conducted experiments, we compare our method with other methods such as Term Frequency-Inverse Document Frequency (TF-IDF) and Latent Semantic Indexing (LSI). Experimental results show that our method outperforms those two methods by retrieving more relevant pages for multi-term topics

    Data Stream Clustering: Challenges and Issues

    Full text link
    Very large databases are required to store massive amounts of data that are continuously inserted and queried. Analyzing huge data sets and extracting valuable pattern in many applications are interesting for researchers. We can identify two main groups of techniques for huge data bases mining. One group refers to streaming data and applies mining techniques whereas second group attempts to solve this problem directly with efficient algorithms. Recently many researchers have focused on data stream as an efficient strategy against huge data base mining instead of mining on entire data base. The main problem in data stream mining means evolving data is more difficult to detect in this techniques therefore unsupervised methods should be applied. However, clustering techniques can lead us to discover hidden information. In this survey, we try to clarify: first, the different problem definitions related to data stream clustering in general; second, the specific difficulties encountered in this field of research; third, the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and how several prominent solutions tackle different problems. Index Terms- Data Stream, Clustering, K-Means, Concept driftComment: IMECS201

    Classic term weighting technique for mining web content outliers

    Get PDF
    Outlier analysis has become a popular topic in the field of data mining but there have been less work on how to detect outliers in web content. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. Term Frequency (TF) techniques from Information Retrieval (IR) have been used to detect the relevancy of a term in a web document. However, when document length varies, relative frequency is preferred. This study used maximum frequency normalization and applied Inverse Document Frequency (IDF) weighting technique which is a traditional term weighting method in IR to use the value of less frequent terms among documents which are considered as more discriminative than frequent terms. The dataset is from The 20 Newsgroups Dataset. TF.IDF is used in dissimilarity measure and the result achieves up to 91.10% of accuracy, which is about 17.77% higher than the previous technique

    An integrative cancer classification based on gene expression data

    Get PDF
    The advent of integrative approach has shifted cancer classification task from purely data-centric to incorporate prior biological knowledge. Integrative analysis of gene expression data with multiple biological sources is viewed as a promising approach to classify and to reveal relevant cancer-specific biomarker genes. The identification of biomarker genes can be used as a powerful tool for understanding the complex biological mechanisms, and also for diagnosing and treatment of cancer diseases. However, most integrative-based classifiers only incorporate a single type of biological knowledge with gene expression data within the same analysis. For instance, gene expression data is normally integrated with functional ontology, metabolic pathways, or protein-protein interaction networks, where they are then analysed separately and not simultaneously. Apart from that, current methods generates a large number of candidate genes, which still require further experiments and testing to identify the potential biomarker genes. Hence, this study aims to resolve the problems by proposing a systematic integrative framework for cancer gene expression analysis to the classification task. The association based framework is capable to integrate and analyse multiple prior biological sources simultaneously. Set of biomarker genes that are relevant to the cancer diseases of interest are identified in order to improve classification performance and its interpretability. In this paper, the proposed approach is tested on a breast cancer microarray dataset and integrated with protein interaction and metabolic pathway data. The results shows that the classification accuracy improved if both protein and pathways information are integrated into the microarray data analysis

    Intrusion Detection System with Data Mining Approach: A Review

    Get PDF
    Despite of growing information technology widely, security has remained one challenging area for computers and networks. Recently many researchers have focused on intrusion detection system based on data mining techniques as an efficient strategy. The main problem in intrusion detection system is accuracy to detect new attacks therefore unsupervised methods should be applied. On the other hand, intrusion in system must be recognized in realtime, although, intrusion detection system is also helpful in off-line status for removing weaknesses of network2019;s security. However, data mining techniques can lead us to discover hidden information from network2019;s log data. In this survey, we try to clarify: first,the different problem definitions with regard to network intrusion detection generally; second, the specific difficulties encountered in this field of research; third, the varying assumptions, heuristics, and intuitions forming the basis of erent approaches; and how several prominent solutions tackle different problems

    Expectation maximization clustering algorithm for user modeling in web usage mining system

    Get PDF
    To provide intelligent personalized online services such as web recommender systems, it is usually necessary to model users’ web access behavior. To achieve this, one of the promising approaches is web usage mining, which mines web logs for user models and recommendations. Web usage mining algorithms have been widely utilized for modeling user web navigation behavior. In this study we advance a model for mining of user’s navigation pattern. The model is based on expectation-maximization (EM) algorithm and it is used for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. The experimental results represent that by decreasing the number of clusters, the log likelihood converges toward lower values and probability of the largest cluster will be decreased while the number of the clusters increases in each treatment. The results also indicate that kind of behavior given by EM clustering algorithm has improved the visit-coherence (accuracy) of navigation pattern mining

    A Framework For Intelligent Multi Agent System Based Neural Network Classification Model

    Get PDF
    TIntelligent multi agent systems have great potentials to use in different purposes and research areas. One of the important issues to apply intelligent multi agent systems in real world and virtual environment is to develop a framework that support machine learning model to reflect the whole complexity of the real world. In this paper, we proposed a framework of intelligent agent based neural network classification model to solve the problem of gap between two applicable flows of intelligent multi agent technology and learning model from real environment. We consider the new Supervised Multilayers Feed Forward Neural Network (SMFFNN) model as an intelligent classification for learning model in the framework. The framework earns the information from the respective environment and its behavior can be recognized by the weights. Therefore, the SMFFNN model that lies in the framework will give more benefits in finding the suitable information and the real weights from the environment which result for better recognition. The framework is applicable to different domains successfully and for the potential case study, the clinical organization and its domain is considered for the proposed frameworkComment: 7 pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS 2009, ISSN 1947 5500, Impact Factor 0.423, http://sites.google.com/site/ijcsis
    corecore